MapReduce-Style Computation in Distributed Virtual Memory
نویسندگان
چکیده
Many cloud computing technologies, such as MapReduce, use file systems as the system-wide substrate for data handling. A distributed file system provides a global name space and stores data persistently, but it also introduces significant overhead. Several recent systems use DRAM to store data and tremendously improve the performance of cloud computing systems. However, both our own experience and related work indicate that a simple substitution of distributed DRAM for the file system does not provide a solid and viable foundation for data storage and processing in the datacenter environment, and the capacity of such systems is limited by the amount of physical memory in the cluster. To overcome the challenge, we construct a distributed virtual memory to unify the physical memories and disk resources on many compute nodes to form a system-wide data substrate. The new substrate provides a general memorybased abstraction, takes advantage of DRAM in the system to accelerate computation, and, transparent to programmers, scales the system to handle large datasets by swapping data to disks and remote servers. Above the distributed virtual memory, we develop vMR, a new design of the MapReduce framework with enhanced functionality enabled by the memory-based abstraction. The evaluation results show that vMR is faster than Hadoop on all the 4 workloads, and delivers 6-11x speedups on the adjacency list workload. For iterative k-means clustering, vMR can run more than 7 times faster than Hadoop and scale linearly to 160 compute nodes on the TH-1/GZ supercomputer.
منابع مشابه
MPI for Big Data: New tricks for an old dog
The processing of massive amounts of data on clusters with finite amount of memory has become an important problem facing the parallel/distributed computing community. While MapReduce-style technologies provide an effective means for addressing various problems that fit within the MapReduce paradigm, there are many classes of problems for which this paradigm is ill-suited. In this paper we pres...
متن کاملCross-cloud MapReduce for Big Data
MapReduce plays a critical role as a leading framework for big data analytics. In this paper, we consider a geodistributed cloud architecture that provides MapReduce services based on the big data collected from end users all over the world. Existing work handles MapReduce jobs by a traditional computation-centric approach that all input data distributed in multiple clouds are aggregated to a v...
متن کاملTitle : IEEE Transactions on Cloud Computing Title of Paper : Cross - cloud MapReduce for Big Data
MapReduce plays a critical role as a leading framework for big data analytics. In this paper, we consider a geodistributed cloud architecture that provides MapReduce services based on the big data collected from end users all over the world. Existing work handles MapReduce jobs by a traditional computation-centric approach that all input data distributed in multiple clouds are aggregated to a v...
متن کاملA Distributed Solver for Massive Scale Resource Allocation Linear Programs
The present paper focuses on the problem of solving terabyte sized LPs on an hourly basis given a distributed computational infrastructure; solving these massive LPs is the computational primitive required for yield management problems arising in online advertising. Here we design a linear optimization algorithm borrowing from the multiplicative weights framework of Plotkin, Shmoys and Tardos t...
متن کاملScalable Score Computation for Learning Multinomial Bayesian Networks over Distributed Data
In this paper, we focus on the problem of learning a Bayesian network over distributed data stored in a commodity cluster. Specifically, we address the challenge of computing the scoring function over distributed data in a scalable manner, which is a fundamental task during learning. We propose a novel approach designed to achieve: (a) scalable score computation using the principle of gossiping...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013